Overview

Dataset statistics

Number of variables10
Number of observations380286
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory144.7 MiB
Average record size in memory398.9 B

Variable types

CAT5
NUM5

Reproduction

Analysis started2020-03-19 05:03:58.835994
Analysis finished2020-03-19 05:06:14.199396
Versionpandas-profiling v2.5.0
Command linepandas_profiling --config_file config.yaml [YOUR_FILE.csv]
Download configurationconfig.yaml
StockCode has a high cardinality: 3630 distinct values High cardinality
Description has a high cardinality: 3830 distinct values High cardinality
InvoiceDate has a high cardinality: 16672 distinct values High cardinality
date has a high cardinality: 298 distinct values High cardinality
TotalCost is highly correlated with QuantityHigh Correlation
Quantity is highly correlated with TotalCostHigh Correlation
Quantity is highly skewed (γ1 = 534.8394849) Skewed
UnitPrice is highly skewed (γ1 = 209.4450002) Skewed
TotalCost is highly skewed (γ1 = 443.0241439) Skewed
InvoiceDate only contains datetime values, but is categorical. Consider applying pd.to_datetime()Type
date only contains datetime values, but is categorical. Consider applying pd.to_datetime()Type

Variables

InvoiceNo
Real number (ℝ≥0)

Distinct count17858
Unique (%)4.7%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean561358.8398600001
Minimum537879
Maximum581587
Zeros0
Zeros (%)0.0%
Memory size2.9 MiB

Quantile statistics

Minimum537879
5-th percentile540647
Q1550339
median562539
Q3572295
95-th percentile579529.75
Maximum581587
Range43708
Interquartile range (IQR)21956

Descriptive statistics

Standard deviation12580.59317
Coefficient of variation (CV)0.02241096474
Kurtosis-1.197917943
Mean561358.8399
Median Absolute Deviation (MAD)10938.36662
Skewness-0.1748974811
Sum2.134769078e+11
Variance158271324.5
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[537879. 537887.5 537889.5 537895.5 537900.5 ... 581566.5 581576. 581580.5 581584.5 581587. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
576339 542 0.1%
 
579196 533 0.1%
 
580727 529 0.1%
 
578270 442 0.1%
 
573576 435 0.1%
 
567656 421 0.1%
 
567183 392 0.1%
 
575607 377 0.1%
 
571441 364 0.1%
 
570488 353 0.1%
 
Other values (17848) 375898 98.8%
 
ValueCountFrequency (%) 
537879 8 < 0.1%
 
537880 16 < 0.1%
 
537881 12 < 0.1%
 
537882 4 < 0.1%
 
537883 5 < 0.1%
 
ValueCountFrequency (%) 
581587 15 < 0.1%
 
581586 4 < 0.1%
 
581585 21 < 0.1%
 
581584 2 < 0.1%
 
581583 2 < 0.1%
 

StockCode
Categorical

HIGH CARDINALITY
Distinct count3630
Unique (%)1.0%
Missing0
Missing (%)0.0%
Memory size2.9 MiB
85123A
 
1931
22423
 
1649
85099B
 
1569
47566
 
1382
84879
 
1342
Other values (3625)
372413
ValueCountFrequency (%) 
85123A 1931 0.5%
 
22423 1649 0.4%
 
85099B 1569 0.4%
 
47566 1382 0.4%
 
84879 1342 0.4%
 
20725 1277 0.3%
 
22720 1152 0.3%
 
23203 1091 0.3%
 
POST 1073 0.3%
 
20727 1052 0.3%
 
Other values (3620) 366768 96.4%
 

Length

Max length12
Mean length5.077039386
Min length1
ValueCountFrequency (%) 
Uppercase_Letter 24 68.6%
 
Decimal_Number 10 28.6%
 
Space_Separator 1 2.9%
 
ValueCountFrequency (%) 
Latin 24 68.6%
 
Common 11 31.4%
 
ValueCountFrequency (%) 
ASCII 35 100.0%
 

Description
Categorical

HIGH CARDINALITY
Distinct count3830
Unique (%)1.0%
Missing0
Missing (%)0.0%
Memory size2.9 MiB
WHITE HANGING HEART T-LIGHT HOLDER
 
1924
REGENCY CAKESTAND 3 TIER
 
1649
JUMBO BAG RED RETROSPOT
 
1569
PARTY BUNTING
 
1382
ASSORTED COLOUR BIRD ORNAMENT
 
1342
Other values (3825)
372420
ValueCountFrequency (%) 
WHITE HANGING HEART T-LIGHT HOLDER 1924 0.5%
 
REGENCY CAKESTAND 3 TIER 1649 0.4%
 
JUMBO BAG RED RETROSPOT 1569 0.4%
 
PARTY BUNTING 1382 0.4%
 
ASSORTED COLOUR BIRD ORNAMENT 1342 0.4%
 
LUNCH BAG RED RETROSPOT 1276 0.3%
 
SET OF 3 CAKE TINS PANTRY DESIGN 1152 0.3%
 
POSTAGE 1073 0.3%
 
LUNCH BAG BLACK SKULL. 1052 0.3%
 
SPOTTY BUNTING 1014 0.3%
 
Other values (3820) 366853 96.5%
 

Length

Max length35
Mean length26.66501265
Min length6
ValueCountFrequency (%) 
Uppercase_Letter 26 38.2%
 
Lowercase_Letter 20 29.4%
 
Decimal_Number 10 14.7%
 
Other_Punctuation 7 10.3%
 
Dash_Punctuation 1 1.5%
 
Math_Symbol 1 1.5%
 
Open_Punctuation 1 1.5%
 
Close_Punctuation 1 1.5%
 
Space_Separator 1 1.5%
 
ValueCountFrequency (%) 
Latin 46 67.6%
 
Common 22 32.4%
 
ValueCountFrequency (%) 
ASCII 68 100.0%
 

Quantity
Real number (ℝ≥0)

HIGH CORRELATION
SKEWED
Distinct count347
Unique (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean25.498974456067277
Minimum1
Maximum323980
Zeros0
Zeros (%)0.0%
Memory size2.9 MiB

Quantile statistics

Minimum1
5-th percentile1
Q14
median8
Q324
95-th percentile96
Maximum323980
Range323979
Interquartile range (IQR)20

Descriptive statistics

Standard deviation553.8243856
Coefficient of variation (CV)21.71947686
Kurtosis308880.4755
Mean25.49897446
Median Absolute Deviation (MAD)28.48140828
Skewness534.8394849
Sum9696903
Variance306721.4501
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[1.0000e+00 1.5000e+00 2.5000e+00 3.5000e+00 4.5000e+00 ... 2.5000e+03 4.0400e+03 6.2800e+03 1.0800e+04 3.2398e+05], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
12 49457 13.0%
 
4 47416 12.5%
 
1 39286 10.3%
 
2 36381 9.6%
 
24 27207 7.2%
 
8 27189 7.1%
 
6 24959 6.6%
 
48 22080 5.8%
 
3 16639 4.4%
 
10 14901 3.9%
 
Other values (337) 74771 19.7%
 
ValueCountFrequency (%) 
1 39286 10.3%
 
2 36381 9.6%
 
3 16639 4.4%
 
4 47416 12.5%
 
5 4867 1.3%
 
ValueCountFrequency (%) 
323980 1 < 0.1%
 
74215 1 < 0.1%
 
50160 1 < 0.1%
 
19200 1 < 0.1%
 
12000 1 < 0.1%
 

InvoiceDate
Categorical

HIGH CARDINALITY
TYPE DATE
Distinct count16672
Unique (%)4.4%
Missing0
Missing (%)0.0%
Memory size2.9 MiB
2011-11-14 15:27:00
 
542
2011-11-28 15:54:00
 
533
2011-12-05 17:17:00
 
529
2011-11-23 13:39:00
 
443
2011-10-31 14:09:00
 
435
Other values (16667)
377804
ValueCountFrequency (%) 
2011-11-14 15:27:00 542 0.1%
 
2011-11-28 15:54:00 533 0.1%
 
2011-12-05 17:17:00 529 0.1%
 
2011-11-23 13:39:00 443 0.1%
 
2011-10-31 14:09:00 435 0.1%
 
2011-09-21 14:40:00 421 0.1%
 
2011-11-10 12:37:00 377 0.1%
 
2011-10-17 13:31:00 364 0.1%
 
2011-10-10 17:12:00 353 0.1%
 
2011-10-24 17:07:00 352 0.1%
 
Other values (16662) 375937 98.9%
 

Length

Max length19
Mean length19
Min length19
ValueCountFrequency (%) 
Decimal_Number 10 76.9%
 
Dash_Punctuation 1 7.7%
 
Other_Punctuation 1 7.7%
 
Space_Separator 1 7.7%
 
ValueCountFrequency (%) 
Common 13 100.0%
 
ValueCountFrequency (%) 
ASCII 13 100.0%
 

UnitPrice
Real number (ℝ≥0)

SKEWED
Distinct count633
Unique (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.002069543448877
Minimum0.0
Maximum8142.75
Zeros39
Zeros (%)< 0.1%
Memory size2.9 MiB

Quantile statistics

Minimum0
5-th percentile0.3696
Q11.1
median1.716
Q33.3
95-th percentile8.5
Maximum8142.75
Range8142.75
Interquartile range (IQR)2.2

Descriptive statistics

Standard deviation21.50586539
Coefficient of variation (CV)7.163679947
Kurtosis62982.48651
Mean3.002069543
Median Absolute Deviation (MAD)2.30487974
Skewness209.4450002
Sum1141645.018
Variance462.5022461
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0.000000e+00 5.000000e-04 1.810000e-02 3.760000e-02 5.640000e-02 ... 2.940000e+02 2.966688e+02 8.259904e+02 2.638618e+03 8.142750e+03], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
1.25 28261 7.4%
 
1.65 24169 6.4%
 
2.95 18102 4.8%
 
0.85 17584 4.6%
 
1.1 15349 4.0%
 
0.42 14799 3.9%
 
4.95 11303 3.0%
 
1.452 11102 2.9%
 
3.75 10695 2.8%
 
2.1 10434 2.7%
 
Other values (623) 218488 57.5%
 
ValueCountFrequency (%) 
0 39 < 0.1%
 
0.001 4 < 0.1%
 
0.0352 27 < 0.1%
 
0.04 39 < 0.1%
 
0.0528 53 < 0.1%
 
ValueCountFrequency (%) 
8142.75 1 < 0.1%
 
3661.7328 2 < 0.1%
 
3475.4016 1 < 0.1%
 
2777.236 1 < 0.1%
 
2500 1 < 0.1%
 

CustomerID
Real number (ℝ≥0)

Distinct count4275
Unique (%)1.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean15277.57957432038
Minimum12346.0
Maximum18287.0
Zeros0
Zeros (%)0.0%
Memory size2.9 MiB

Quantile statistics

Minimum12346
5-th percentile12624
Q113949
median15132
Q316779
95-th percentile17891
Maximum18287
Range5941
Interquartile range (IQR)2830

Descriptive statistics

Standard deviation1710.601133
Coefficient of variation (CV)0.1119680722
Kurtosis-1.176653792
Mean15277.57957
Median Absolute Deviation (MAD)1475.561052
Skewness0.03806843411
Sum5809849626
Variance2926156.236
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[12346. 12346.5 12347.5 12348.5 12349.5 ... 18271. 18272.5 18282.5 18285. 18287. ], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
17841 7537 2.0%
 
14911 5543 1.5%
 
14096 5111 1.3%
 
12748 4063 1.1%
 
14606 2557 0.7%
 
15311 2282 0.6%
 
14646 2080 0.5%
 
13089 1773 0.5%
 
13263 1667 0.4%
 
14298 1637 0.4%
 
Other values (4265) 346036 91.0%
 
ValueCountFrequency (%) 
12346 1 < 0.1%
 
12347 151 < 0.1%
 
12348 31 < 0.1%
 
12349 73 < 0.1%
 
12350 17 < 0.1%
 
ValueCountFrequency (%) 
18287 70 < 0.1%
 
18283 721 0.2%
 
18282 12 < 0.1%
 
18281 7 < 0.1%
 
18280 10 < 0.1%
 

Country
Categorical

Distinct count36
Unique (%)< 0.1%
Missing0
Missing (%)0.0%
Memory size2.9 MiB
United Kingdom
337670
Germany
 
8823
France
 
8149
EIRE
 
7096
Spain
 
2475
Other values (31)
 
16073
ValueCountFrequency (%) 
United Kingdom 337670 88.8%
 
Germany 8823 2.3%
 
France 8149 2.1%
 
EIRE 7096 1.9%
 
Spain 2475 0.7%
 
Netherlands 2361 0.6%
 
Belgium 2019 0.5%
 
Switzerland 1836 0.5%
 
Portugal 1386 0.4%
 
Australia 1162 0.3%
 
Other values (26) 7309 1.9%
 

Length

Max length20
Mean length13.18699084
Min length3
ValueCountFrequency (%) 
Lowercase_Letter 22 55.0%
 
Uppercase_Letter 17 42.5%
 
Space_Separator 1 2.5%
 
ValueCountFrequency (%) 
Latin 39 97.5%
 
Common 1 2.5%
 
ValueCountFrequency (%) 
ASCII 40 100.0%
 

date
Categorical

HIGH CARDINALITY
TYPE DATE
Distinct count298
Unique (%)0.1%
Missing0
Missing (%)0.0%
Memory size2.9 MiB
2011-11-06
 
3340
2011-12-05
 
3335
2011-11-23
 
3216
2011-11-10
 
3104
2011-11-20
 
3003
Other values (293)
364288
ValueCountFrequency (%) 
2011-11-06 3340 0.9%
 
2011-12-05 3335 0.9%
 
2011-11-23 3216 0.8%
 
2011-11-10 3104 0.8%
 
2011-11-20 3003 0.8%
 
2011-11-17 2945 0.8%
 
2011-11-14 2854 0.8%
 
2011-10-30 2850 0.7%
 
2011-11-22 2758 0.7%
 
2011-11-28 2708 0.7%
 
Other values (288) 350173 92.1%
 

Length

Max length10
Mean length10
Min length10
ValueCountFrequency (%) 
Decimal_Number 10 90.9%
 
Dash_Punctuation 1 9.1%
 
ValueCountFrequency (%) 
Common 11 100.0%
 
ValueCountFrequency (%) 
ASCII 11 100.0%
 

TotalCost
Real number (ℝ≥0)

HIGH CORRELATION
SKEWED
Distinct count2778
Unique (%)0.7%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean22.713856949769387
Minimum0.0
Maximum168469.6
Zeros39
Zeros (%)< 0.1%
Memory size2.9 MiB

Quantile statistics

Minimum0
5-th percentile1.25
Q14.95
median12.48
Q319.8
95-th percentile67.6
Maximum168469.6
Range168469.6
Interquartile range (IQR)14.85

Descriptive statistics

Standard deviation315.7374474
Coefficient of variation (CV)13.90065316
Kurtosis223023.0172
Mean22.71385695
Median Absolute Deviation (MAD)21.06027898
Skewness443.0241439
Sum8637761.804
Variance99690.13567
Histogram with fixed size bins (bins=10)
Histogram with variable size bins (bins=[0.000000e+00 5.000000e-04 1.100000e-01 1.300000e-01 1.850000e-01 ... 2.126400e+03 3.300000e+03 4.956750e+03 7.643735e+03 1.684696e+05], "bayesian blocks" binning strategy used)
ValueCountFrequency (%) 
15 19465 5.1%
 
19.8 10812 2.8%
 
17.7 8821 2.3%
 
16.5 8399 2.2%
 
10.2 7722 2.0%
 
1.25 6263 1.6%
 
3.75 6115 1.6%
 
20.8 5494 1.4%
 
10.5 5414 1.4%
 
2.5 5298 1.4%
 
Other values (2768) 296483 78.0%
 
ValueCountFrequency (%) 
0 39 < 0.1%
 
0.001 4 < 0.1%
 
0.06 1 < 0.1%
 
0.08 1 < 0.1%
 
0.1 3 < 0.1%
 
ValueCountFrequency (%) 
168469.6 1 < 0.1%
 
77183.6 1 < 0.1%
 
38970 1 < 0.1%
 
8142.75 1 < 0.1%
 
7144.72 1 < 0.1%
 

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Missing values

Sample

First rows

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountrydateTotalCost
053787921524DOORMAT SPOTTY HOME SWEET HOME22010-12-09 08:34:007.9514243.0United Kingdom2010-12-0915.9
153787922114HOT WATER BOTTLE TEA AND SYMPATHY122010-12-09 08:34:003.9514243.0United Kingdom2010-12-0947.4
253787922694WICKER STAR62010-12-09 08:34:002.1014243.0United Kingdom2010-12-0912.6
353787922835HOT WATER BOTTLE I AM SO POORLY82010-12-09 08:34:004.6514243.0United Kingdom2010-12-0937.2
45378798504815CM CHRISTMAS GLASS BALL 20 LIGHTS42010-12-09 08:34:007.9514243.0United Kingdom2010-12-0931.8
553787985150LADIES & GENTLEMEN METAL SIGN62010-12-09 08:34:002.5514243.0United Kingdom2010-12-0915.3
653787982494LWOODEN FRAME ANTIQUE WHITE122010-12-09 08:34:002.9514243.0United Kingdom2010-12-0935.4
753787985123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-09 08:34:002.9514243.0United Kingdom2010-12-0917.7
853788085150LADIES & GENTLEMEN METAL SIGN122010-12-09 09:14:002.5512963.0United Kingdom2010-12-0930.6
953788048187DOORMAT NEW ENGLAND22010-12-09 09:14:007.9512963.0United Kingdom2010-12-0915.9

Last rows

InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIDCountrydateTotalCost
38027658158722726ALARM CLOCK BAKELIKE GREEN162011-12-09 12:50:003.30012680.0France2011-12-0915.00
38027758158722138BAKING SET 9 PIECE RETROSPOT122011-12-09 12:50:004.35612680.0France2011-12-0914.85
38027858158722629SPACEBOY LUNCH BOX482011-12-09 12:50:001.71612680.0France2011-12-0923.40
38027958158722613PACK OF 20 SPACEBOY NAPKINS482011-12-09 12:50:000.74812680.0France2011-12-0910.20
38028058158722556PLASTERS IN TIN CIRCUS PARADE482011-12-09 12:50:001.45212680.0France2011-12-0919.80
38028158158722555PLASTERS IN TIN STRONGMAN482011-12-09 12:50:001.45212680.0France2011-12-0919.80
38028258158722367CHILDRENS APRON SPACEBOY DESIGN322011-12-09 12:50:001.71612680.0France2011-12-0915.60
38028358158723255CHILDRENS CUTLERY CIRCUS PARADE162011-12-09 12:50:003.65212680.0France2011-12-0916.60
38028458158722631CIRCUS PARADE LUNCH BOX482011-12-09 12:50:001.71612680.0France2011-12-0923.40
38028558158723256CHILDRENS CUTLERY SPACEBOY162011-12-09 12:50:003.65212680.0France2011-12-0916.60